34 research outputs found

    Kinematic Model of the Hand using Computer Vision

    Get PDF
    La biotecnología es una ciencia en auge y en especial el diseño de interfaces humano-máquina. El objetivo de este proyecto es avanzar en dicho campo y en concreto explorar el diseño de exoesqueletos y prótesis de la mano humana. La metodología utilizada en este proyecto fundamentalmente consta de tres fases. En primer lugar, se ha establecido un modelo teórico de la cinemática de la mano recurriendo a la documentación médica especializada para concretar su anatomía. Posteriormente se ha procedido a sintetizar la mano en sus parámetros simplificados y así definir un modelo robótico. Para ajustar dicho modelo a una mano real se procede a capturar el movimiento de esta en una secuencia de imágenes mediante ordenador. Para ello se utilizan unas marcas en las uñas de la mano con una geometría específica de tal manera que permite la estimación de su pose, es decir su posición e orientación, en el espacio. Esta secuencia de poses estimadas permite caracterizar el movimiento completo de la mano. Por último, mediante la síntesis cinemática dimensional, se definen las ecuaciones de movimiento parametrizadas del modelo teórico de la mano. Estas ecuaciones permiten ajustar el modelo a la secuencia de poses estimadas mediante visión por ordenador y así crear un modelo personalizado de la mano. Gracias a este sistema, se puede realizar un estudio sobre la correspondencia entre señales electomiográficas y los movimientos de la mano y así lograr una mejor funcionalidad de las prótesis. En definitiva, este proyecto ha logrado diseñar un algoritmo robusto para el seguimiento y estimación de las poses de las uñas de las manos y ha conseguido definir las ecuaciones de movimiento y crear una aplicación para resolverlas. Asimismo, ha encontrado modelos no antropomórficos que podrían ser de utilidad en el diseño de exoesqueletos

    Understanding human-centric images : from geometry to fashion

    Get PDF
    Understanding humans from photographs has always been a fundamental goal of computer vision. Early works focused on simple tasks such as detecting the location of individuals by means of bounding boxes. As the field progressed, harder and more higher level tasks have been undertaken. For example, from human detection came the 2D and 3D human pose estimation in which the task consisted of identifying the location in the image or space of all different body parts, e.g., head, torso, knees, arms, etc. Human attributes also became a great source of interest as they allow recognizing individuals and other properties such as gender or age. Later, the attention turned to the recognition of the action being performed. This, in general, relies on the previous works on pose estimation and attribute classification. Currently, even higher level tasks are being conducted such as predicting the motivations of human behavior or identifying the fashionability of an individual from a photograph. In this thesis we have developed a hierarchy of tools that cover all these range of problems, from low level feature point descriptors to high level fashion-aware conditional random fields models, all with the objective of understanding humans from monocular, RGB images. In order to build these high level models it is paramount to have a battery of robust and reliable low and mid level cues. Along these lines, we have proposed two low-level keypoint descriptors: one based on the theory of the heat diffusion on images, and the other that uses a convolutional neural network to learn discriminative image patch representations. We also introduce distinct low-level generative models for representing human pose: in particular we present a discrete model based on a directed acyclic graph and a continuous model that consists of poses clustered on a Riemannian manifold. As mid level cues we propose two 3D human pose estimation algorithms: one that estimates the 3D pose given a noisy 2D estimation, and an approach that simultaneously estimates both the 2D and 3D pose. Finally, we formulate higher level models built upon low and mid level cues for human understanding. Concretely, we focus on two different tasks in the context of fashion: semantic segmentation of clothing, and predicting the fashionability from images with metadata to ultimately provide fashion advice to the user. In summary, to robustly extract knowledge from images with the presence of humans it is necessary to build high level models that integrate low and mid level cues. In general, using and understanding strong features is critical for obtaining reliable performance. The main contribution of this thesis is in proposing a variety of low, mid and high level algorithms for human-centric images that can be integrated into higher level models for comprehending humans from photographs, as well as tackling novel fashion-oriented problems.Siempre ha sido una meta fundamental de la visión por computador la comprensión de los seres humanos. Los primeros trabajos se fijaron en objetivos sencillos tales como la detección en imágenes de la posición de los individuos. A medida que la investigación progresó se emprendieron tareas mucho más complejas. Por ejemplo, a partir de la detección de los humanos se pasó a la estimación en dos y tres dimensiones de su postura por lo que la tarea consistía en identificar la localización en la imagen o el espacio de las diferentes partes del cuerpo, por ejemplo cabeza, torso, rodillas, brazos, etc...También los atributos humanos se convirtieron en una gran fuente de interés ya que permiten el reconocimiento de los individuos y de sus propiedades como el género o la edad. Más tarde, la atención se centró en el reconocimiento de la acción realizada. Todos estos trabajos reposan en las investigaciones previas sobre la estimación de las posturas y la clasificación de los atributos. En la actualidad, se llevan a cabo investigaciones de un nivel aún superior sobre cuestiones tales como la predicción de las motivaciones del comportamiento humano o la identificación del tallaje de un individuo a partir de una fotografía. En esta tesis desarrollamos una jerarquía de herramientas que cubre toda esta gama de problemas, desde descriptores de rasgos de bajo nivel a modelos probabilísticos de campos condicionales de alto nivel reconocedores de la moda, todos ellos con el objetivo de mejorar la comprensión de los humanos a partir de imágenes RGB monoculares. Para construir estos modelos de alto nivel es decisivo disponer de una batería de datos robustos y fiables de nivel bajo y medio. En este sentido, proponemos dos descriptores novedosos de bajo nivel: uno se basa en la teoría de la difusión de calor en las imágenes y otro utiliza una red neural convolucional para aprender representaciones discriminativas de trozos de imagen. También introducimos diferentes modelos de bajo nivel generativos para representar la postura humana: en particular presentamos un modelo discreto basado en un gráfico acíclico dirigido y un modelo continuo que consiste en agrupaciones de posturas en una variedad de Riemann. Como señales de nivel medio proponemos dos algoritmos estimadores de la postura humana: uno que estima la postura en tres dimensiones a partir de una estimación imprecisa en el plano de la imagen y otro que estima simultáneamente la postura en dos y tres dimensiones. Finalmente construimos modelos de alto nivel a partir de señales de nivel bajo y medio para la comprensión de la persona a partir de imágenes. En concreto, nos centramos en dos diferentes tareas en el ámbito de la moda: la segmentación semántica del vestido y la predicción del buen ajuste de la prenda a partir de imágenes con meta-datos con la finalidad de aconsejar al usuario sobre moda. En resumen, para extraer conocimiento a partir de imágenes con presencia de seres humanos es preciso construir modelos de alto nivel que integren señales de nivel medio y bajo. En general, el punto crítico para obtener resultados fiables es el empleo y la comprensión de rasgos fuertes. La aportación fundamental de esta tesis es la propuesta de una variedad de algoritmos de nivel bajo, medio y alto para el tratamiento de imágenes centradas en seres humanos que pueden integrarse en modelos de alto nivel, para mejor comprensión de los seres humanos a partir de fotografías, así como abordar problemas planteados por el buen ajuste de las prendas

    Multi-modal joint embedding for fashion product retrieval

    Get PDF
    © 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites are updating with thousands of images and their associated metadata (textual information), deepening the problem, akin to finding a needle in a haystack. In this paper, we leverage both the images and textual meta-data and propose a joint multi-modal embedding that maps both the text and images into a common latent space. Distances in the latent space correspond to similarity between products, allowing us to effectively perform retrieval in this latent space, which is both efficient and accurate. We train this embedding using large-scale real world e-commerce data by both minimizing the similarity between related products and using auxiliary classification networks to that encourage the embedding to have semantic meaning. We compare against existing approaches and show significant improvements in retrieval tasks on a large-scale e-commerce dataset. We also provide an analysis of the different metadata.Peer ReviewedPostprint (author's final draft

    Geodesic finite mixture models

    Get PDF
    We present a novel approach for learning a finite mixture model on a Riemannian manifold in which Euclidean metrics are not applicable and one needs to resort to geodesic distances consistent with the manifold geometry. For this purpose, we draw inspiration on a variant of the expectation-maximization algorithm, that uses a minimum message length criterion to automatically estimate the optimal number of components from multivariate data lying on an Euclidean space. In order to use this approach on Riemannian manifolds, we propose a formulation in which each component is defined on a different tangent space, thus avoiding the problems associated with the loss of accuracy produced when linearizing the manifold with a single tangent space. Our approach can be applied to any type of manifold for which it is possible to estimate its tangent space. In particular, we show results on synthetic examples of a sphere and a quadric surface and on a large and complex dataset of human poses, where the proposed model is used as a regression tool for hypothesizing the geometry of occluded parts of the body.Peer ReviewedPostprint (published version

    Design of non-anthropomorphic robotic hands for anthropomorphic tasks

    Get PDF
    In this paper, we explore the idea of designing non- anthropomorphic multi-fingered robotic hands for tasks tha t replicate the motion of the human hand. Taking as input data a finite set of rigid-body positions for the five fingertips, we de- velop a method to perform dimensional synthesis for a kinema tic chain with a tree structure, with five branches that share thr ee common joints. We state the forward kinematics equations of relative dis- placements for each serial chain expressed as dual quaterni ons, and solve for up to five chains simultaneously to reach a numbe r of positions along the hand trajectory. This is done using a h y- brid global numerical solver that integrates a genetic algo rithm and a Levenberg-Marquardt local optimizer. Although the number of candidate solutions in this problem is very high, the use of the genetic algorithm allows us to per form an exhaustive exploration of the solution space to obtain a s et of solutions. We can then choose some of the solutions based on t he specific task to perform. Note that these designs match the ta sk exactly while generally having a finger design radically dif ferent from that of the human hand.Peer ReviewedPostprint (author’s final draft

    Multi-modal fashion product retrieval

    Get PDF
    Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites are updating with thousands of images and their associated metadata (textual information), deepening the problem. In this paper, we leverage both the images and textual metadata and propose a joint multi-modal embedding that maps both the text and images into a common latent space. Distances in the latent space correspond to similarity between products, allowing us to effectively perform retrieval in this latent space. We compare against existing approaches and show significant improvements in retrieval tasks on a largescale e-commerce dataset.Peer ReviewedPostprint (author's final draft

    Multi-modal embedding for main product detection in fashion

    Get PDF
    © 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Best Paper Award a la 2017 IEEE International Conference on Computer Vision WorkshopsWe present an approach to detect the main product in fashion images by exploiting the textual metadata associated with each image. Our approach is based on a Convolutional Neural Network and learns a joint embedding of object proposals and textual metadata to predict the main product in the image. We additionally use several complementary classification and overlap losses in order to improve training stability and performance. Our tests on a large-scale dataset taken from eight e-commerce sites show that our approach outperforms strong baselines and is able to accurately detect the main product in a wide diversity of challenging fashion images.Peer ReviewedAward-winningPostprint (author's final draft

    BASS: boundary-aware superpixel segmentation

    Get PDF
    © 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.We propose a new superpixel algorithm based on exploiting the boundary information of an image, as objects in images can generally be described by their boundaries. Our proposed approach initially estimates the boundaries and uses them to place superpixel seeds in the areas in which they are more dense. Afterwards, we minimize an energy function in order to expand the seeds into full superpixels. In addition to standard terms such as color consistency and compactness, we propose using the geodesic distance which concentrates small superpixels in regions of the image with more information, while letting larger superpixels cover more homogeneous regions. By both improving the initialization using the boundaries and coherency of the superpixels with geodesic distances, we are able to maintain the coherency of the image structure with fewer superpixels than other approaches. We show the resulting algorithm to yield smaller Variation of Information metrics in seven different datasets while maintaining Undersegmentation Error values similar to the state-of-the-art methods.Peer ReviewedPostprint (author's final draft

    DaLI: deformation and light invariant descriptor

    Get PDF
    Recent advances in 3D shape analysis and recognition have shown that heat diffusion theory can be effectively used to describe local features of deforming and scaling surfaces. In this paper, we show how this description can be used to characterize 2D image patches, and introduce DaLI, a novel feature point descriptor with high resilience to non-rigid image transformations and illumination changes. In order to build the descriptor, 2D image patches are initially treated as 3D surfaces. Patches are then described in terms of a heat kernel signature, which captures both local and global information, and shows a high degree of invariance to non-linear image warps. In addition, by further applying a logarithmic sampling and a Fourier transform, invariance to photometric changes is achieved. Finally, the descriptor is compacted by mapping it onto a low dimensional subspace computed using Principal Component Analysis, allowing for an efficient matching. A thorough experimental validation demonstrates that DaLI is significantly more discriminative and robust to illuminations changes and image transformations than state of the art descriptors, even those specifically designed to describe non-rigid deformations.Peer ReviewedPostprint (author's final draft
    corecore